AITopics | Norco

Collaborating Authors

Norco

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Haas, Lukas, Yona, Gal, D'Antonio, Giovanni, Goldshtein, Sasha, Das, Dipanjan

arXiv.org Artificial IntelligenceSep-10-2025

We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.07968

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
South America > Colombia (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
(7 more...)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (1.00)
Government (0.69)
Media > Television (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)

Add feedback

Do Bayesian Neural Networks Improve Weapon System Predictive Maintenance?

Potter, Michael, Jun, Miru

arXiv.org Artificial IntelligenceJan-6-2024

This approach lacks the extra information on individual systems with interval-censored data and time-varying weapon system characteristics. A recent method introduced the covariates. We analyze and benchmark our approach, Weibull-Cox Bayesian Neural Network tested on several LaplaceNN, on synthetic and real datasets with standard weapon systems, albeit requiring a held-out validation set [7]. classification metrics such as Receiver Operating Characteristic Moreover, while understanding the population reliability trends (ROC) Area Under Curve (AUC) Precision-Recall (PR) AUC, via a Weibull distribution is informative, this formulation does and reliability curve visualizations.

approximation, dataset, weapon system, (13 more...)

arXiv.org Artificial Intelligence

2312.10494

Country:

North America > United States > California > Riverside County > Norco (0.05)
Asia > Pakistan (0.04)

Genre: Research Report (0.82)

Industry: Government > Military (0.99)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)

Add feedback

Bayesian Weapon System Reliability Modeling with Cox-Weibull Neural Network

Potter, Michael, Cheng, Benny

arXiv.org Artificial IntelligenceApr-14-2023

We propose to integrate weapon system features (such as weapon system manufacturer, deployment time and location, storage time and location, etc.) into a parameterized Cox-Weibull [1] reliability model via a neural network, like DeepSurv [2], to improve predictive maintenance. In parallel, we develop an alternative Bayesian model by parameterizing the Weibull parameters with a neural network and employing dropout methods such as Monte-Carlo (MC)-dropout for comparative purposes. Due to data collection procedures in weapon system testing we employ a novel interval-censored log-likelihood which incorporates Monte-Carlo Markov Chain (MCMC) [3] sampling of the Weibull parameters during gradient descent optimization. We compare classification metrics such as receiver operator curve (ROC) area under the curve (AUC), precision-recall (PR) AUC, and F scores to show our model generally outperforms traditional powerful models such as XGBoost and the current standard conditional Weibull probability density estimation model.

artificial intelligence, machine learning, weapon system, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/RAMS51473.2023.10088222

2301.0185

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > Riverside County > Norco (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.47)

Industry:

Health & Medicine (1.00)
Government > Military (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.35)

Add feedback